{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Use ChromaDBQueryEngine to query Markdown files \n",
    "\n",
    "This notebook demonstrates the use of the `ChromaDBQueryEngine` for retrieval-augmented question answering over documents. It shows how to set up the engine with Docling parsed Markdown files, and execute natural language queries against the indexed data. \n",
    "\n",
    "The `ChromaDBQueryEngine` integrates persistent ChromaDB vector storage with LlamaIndex for efficient document retrieval.\n",
    "\n",
    "You can create and add this ChromaDBQueryEngine to [DocAgent](https://docs.ag2.ai/latest/docs/user-guide/reference-agents/docagent) to use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-vector-stores-chroma==0.4.1\n",
    "%pip install llama-index==0.12.16"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load LLM configuration\n",
    "\n",
    "This demonstration requires an `OPENAI_API_KEY` to be in your environment variables. See [our documentation](https://docs.ag2.ai/latest/docs/user-guide/advanced-concepts/llm-configuration-deep-dive) for guidance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import autogen\n",
    "\n",
    "config_list = autogen.config_list_from_json(env_or_file=\"../OAI_CONFIG_LIST\")\n",
    "\n",
    "assert len(config_list) > 0\n",
    "print(\"models to use: \", [config_list[i][\"model\"] for i in range(len(config_list))])\n",
    "\n",
    "# Put the OpenAI API key into the environment\n",
    "os.environ[\"OPENAI_API_KEY\"] = config_list[0][\"api_key\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Refer to this [link](https://docs.trychroma.com/production/containers/docker) for running Chromadb in a Docker container.\n",
    "    If the host and port are not provided, the engine will create an in-memory ChromaDB client.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from autogen.agentchat.contrib.rag.chromadb_query_engine import ChromaDBQueryEngine\n",
    "\n",
    "query_engine = ChromaDBQueryEngine(\n",
    "    host=\"host.docker.internal\",  # Change this to the IP of the ChromaDB server\n",
    "    port=8000,  # Change this to the port of the ChromaDB server\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we can see the default collection name in the vector store, this is where all documents will be ingested. When creating the `ChromaDBQueryEngine` you can specify a `collection_name` to ingest into."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(query_engine.get_collection_name())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's ingest a document and query it.\n",
    "\n",
    "If you don't have your documents ingested yet, follow the next two cells. Otherwise skip to the `connect_db` cell.\n",
    "\n",
    " `init_db` will overwrite the existing collection with the same name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_dir = (\n",
    "    \"/workspaces/ag2/test/agents/experimental/document_agent/pdf_parsed/\"  # Update to match your input directory\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_docs = [input_dir + \"nvidia_10k_2024.md\"]  # Update to match your input documents\n",
    "query_engine.init_db(new_doc_paths_or_urls=input_docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the given collection already has the document you need, you can use ```connect_db``` to avoid overwriting the existing collection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# query_engine.connect_db()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "question = \"How much money did Nvidia spend in research and development\"\n",
    "answer = query_engine.query(question)\n",
    "print(answer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great, we got the data we needed. Now, let's add another document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_docs = [input_dir + \"Toast_financial_report.md\"]\n",
    "query_engine.add_docs(new_doc_paths_or_urls=new_docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And query again from the same database but this time for another corporate entity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "question = \"How much money did Toast earn in 2024?\"\n",
    "answer = query_engine.query(question)\n",
    "print(answer)"
   ]
  }
 ],
 "metadata": {
  "front_matter": {
   "description": "ChromaDB Query Engine for document queries",
   "tags": [
    "agents",
    "documents",
    "RAG",
    "docagent",
    "chroma",
    "chromadb"
   ]
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.21"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
